Goto

Collaborating Authors

 figure 12



Pseudo codes

Neural Information Processing Systems

Note that we don't validate the inner-loop'sฮป at every outer-loop iteration, but keep changing it on-the-fly at each validation cycle.


A Data Collection and Details about the

Neural Information Processing Systems

We collected about 30 million text-image pairs from multiple channels, and built a 2.5TB new dataset (after tokenization, the size becomes about 250GB). The sources of data are basically classified into the following categories: (1) Professional image websites (both English and Chinese). The images in the websites are usually with captions. We have already introduced tokenizers in section 2.2, and here are some details. Colored grids are all the tokens attended to by the token marked "O".


4fc81f4cd2715d995018e0799262176b-Supplemental-Conference.pdf

Neural Information Processing Systems

Two other important techniques are mixed precision training [36] and in-place activated BatchNorm [53]. Mixed precision training involves training using both 32-bit and 16-bit IEEE floating point numbers depending onthenumerical sensitivityofdifferent layers [36].


PredictingTrainingTimeWithoutTraining SupplementaryMaterial

Neural Information Processing Systems

In both cases we observe that the predicted curve is reasonably close to the actual curve, more so at the beginning of the training (which is expected, sincethelinearapproximation ismorelikelytohold). Point-wise similarity of predicted and observed loss curve. Up to now we focused on prediction error rates (see e.g. We started defining training time as the first time the (smoothed) loss is belowagiventhreshold(whichwethennormalizedw.r.t. In Section 4we suggest that, in the case of MSE loss, itispossible to predict the training time on alargedataset using asubset ofthesamples. However,sinceourtraining time definition measures the time to reach the asymptotic value (which is what is most useful in practice) rather than the time reach an absolute threshold, this does not affect the accuracy of the prediction(seeAppendixC).



A Proof of Lemma 1 According to the second condition in (8), we have q (x) = q (x

Neural Information Processing Systems

Therefore, it fails to control the false positive rate. Figure 10: Distribution of naive p -value when the null hypothesis is true. Figure 11: Distribution of selective p -value when the null hypothesis is true. Figure 12: Uniform QQ-plot of the pivot. In the above example, we used 3 cuts (pieces) to approximate the function. Figure 13, we show that # encountered intervals still linearly increase in practice. Figure 13: Demonstration of # encountered and # truncation intervals when increasing # cuts (pieces).